1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(PairedData)
library(dplyr)
library(ggplot2)
library(devtools)
library(MASS)
library(BAS)

2.1 Part 1 - Exploratory Data Analysis (EDA)

Let’s check some relations between variables which we can make use of in the later stages.

2.1.1 Part 1.1 : Age vs Price

We will create a new variable age, as we will use in many cases.

ames_train <- ames_train %>% mutate(age=2018-Year.Built)

As normally we all think what is the relation between price and age

ggplot(data=ames_train, aes(x=age)) + geom_histogram(aes(y =..density..), bins=30) + geom_density(col="red") +
  geom_vline(aes(xintercept = mean(age)),col='red')+
  geom_vline(aes(xintercept = median(age)),col='blue')+
  geom_text(data = ames_train, aes( x = (mean(ames_train$age)), y = .015, label = 'mean', col='red'), size = 3, parse = T) +
  geom_text(data = ames_train,aes( x = (median(ames_train$age)),y = .020,  label = 'median'), col='blue', size = 3, parse = T)

Plot shows the multimodal distribution and the right skewness. It means that we may need to use robust statistics.

summary(ames_train$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0    17.0    43.0    45.8    63.0   146.0

Above summary statistics show that 50% of the houses are aged between 17 and 63 years. And 50% of houses are less than 43 years old.

summary(ames_train$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129762  159467  181190  213000  615000

Above summary statistics show that houses are sold at a max of $615,000. And 50% of houses are priced between 130k and 180k approximately.

Let’s see their relation ship

ggplot(data=ames_train, aes(x=log(age), y=log(price))) + geom_point() + stat_smooth(method = "lm", se = FALSE)

Taken log of price and age. They are better in finding the good relation due to skewness (as seen in Peer assignment 1). From the plot, as the age increases, the price decreases.

2.1.2 Part 1.2 : Sale Condition vs Price

What kind of houses are sold a lot?

Cookbook data :

Sale Type (Nominal): Type of sale
        
       WD - Warranty Deed - Conventional
       CWD - Warranty Deed - Cash
       VWD - Warranty Deed - VA Loan
       New - Home just constructed and sold
       COD - Court Officer Deed/Estate
       Con - Contract 15% Down payment regular terms
       ConLw - Contract Low Down payment and low interest
       ConLI - Contract Low Interest
       ConLD - Contract Low Down
       Oth - Other
        
Sale Condition (Nominal): Condition of sale

       Normal - Normal Sale
       Abnorml - Abnormal Sale -  trade, foreclosure, short sale
       AdjLand- Adjoining Land Purchase
       Alloca   - Allocation - two linked properties with separate deeds, typically condo with a garage unit    
       Family   - Sale between family members
       Partial- Home was not completed when last assessed (associated with New Homes)
mosaic_table = table(ames_train$Sale.Condition, ames_train$Sale.Type)
mosaicplot(mosaic_table, "Sale Condition vs Sale Type", xlab="Sale Condition", ylab="Sale Type", color = 2:5, dir=c("v","h"), las=2, off=20)

Inferences

  • We can observe that houses with normal conditions are sold in high percentage with warranty deed
  • Even Sale.Condition Abnormal houses are sold in good numbers compared to others (except Normal)
  • Sale between family members Sale.Condition == "Family" are also good with good Sale.Type
  • Adjoining Land Purchase Sale.Condition == "AdjLand" is very small compared to all other sale conditions

Corresponding summary stats

ames_train %>% filter(!is.na(Sale.Condition)) %>% group_by(Sale.Condition) %>% summarise(count=n(), mean=mean(price), median=median(price), IQR=IQR(price), min=min(price), max=max(price), sd=sd(price))
## # A tibble: 6 x 8
##   Sale.Condition count    mean  median    IQR    min    max      sd
##   <fct>          <int>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 Abnorml           61 143740. 132000   55000  12789 475000  76042.
## 2 AdjLand            2 138750  138750   11250 127500 150000  15910.
## 3 Alloca             4 156766. 152556.  11483 142953 179000  15557.
## 4 Family            17 146959. 149000   19000  82500 235000  41490.
## 5 Normal           834 174622. 155500   76000  39300 615000  72270.
## 6 Partial           82 285172. 254146. 161610 115000 611657 107858.
  • Normal and Abnormal houses are sold in relatively near price compared to AdjLand,Family and Alloca
  • Partial has much variance compared to very low variance of Alloca, Family and AdjLand (check the plot below too)
ggplot(data=ames_train, aes(x=factor(Sale.Condition), y=price)) + geom_boxplot()

ames_train %>% ggplot(aes(x=price))+
    geom_histogram(aes(fill=factor(ames_train$Sale.Condition)))+
    labs(title="Price distribution by Sale.Condition", xlab="Price", ylab="House Count", fill="Sale.Condition", color=1:5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot also shows that the house price skewness is influenced partially by Sales not done in Normal condition.

2.1.3 Part 1.3 : Living Area vs Price vs Overall.Qual

Heat Map of the log(area) vs log(price) gives us where the population density is concentrated. It can see that inferenced that many homes are chosen for the area between exp(7.0) (1096) to exp(7.3) (1480) square feet with the price range around exp(12) ($162,754)

ggplot(data=ames_train, aes(x=log(area), y=log(price),color = "red")) + 
  geom_point() + 
  stat_density_2d(aes(fill = ..level..), geom="polygon") + 
  scale_fill_gradient(low="blue", high="red") + 
  stat_smooth(method = "lm", se = FALSE)

Below plot can be used to know the Overall.Qual based distribution. It is to know what kind of quality homes present in what area. Eg: lower quality homes are mostly present in the bottom left while higher quality homes are in the top right

ggplot(data=ames_train, aes(x=log(area), y=log(price), color=factor(Overall.Qual), size=factor(Overall.Qual))) + 
  geom_point(alpha=0.4) + 
  scale_color_manual(values = c("red","green","yellow","purple","blue","pink","orange","black","violet","grey")) 
## Warning: Using size for a discrete variable is not advised.

2.1.4 Part 1.4 : Some other camparisons

2.1.4.1 Part 1.4.1 : area vs price

ames_train %>% ggplot(aes(x=area ,y=log(price)))+ geom_point(color="red")+ geom_smooth(method = "lm", fill="blue")

ames_train %>% ggplot(aes(x=log(area) ,y=log(price)))+ geom_point(color="red")+ geom_smooth(method = "lm", fill="blue")

We'll use log(area) and log(price) for the analysis instead of just area and price to adjust to the right skewness of the price

histogram(ames_train$price)

2.1.4.2 Part 1.4.2 : Bldg.Type vs price

ames_train %>% ggplot(aes(x=Bldg.Type ,y=log(price), fill=Bldg.Type))+ geom_point(color="red")+ geom_boxplot()

Fam and Twnhs has good variability

2.1.4.3 Part 1.4.3 : Sale.Type vs price

ames_train %>% ggplot(aes(x=Sale.Type ,y=log(price), fill=Sale.Type))+ geom_point(color="red")+ geom_boxplot()


2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

Based on EDA, we’ll choose the below variables , area, Lot.Area, price, age, Bldg.Type, Neighborhood, Sale.Type, Sale.Condition, Bsmt.Qual and Bedroom.AbvGr for the initial model analysis. Some params like Bsmt.Qual, Neighborhood, Lot.Area are chosen on assumption

Let’s check for NA values first.

NA_count_cols <- c("area", "Lot.Area", "price", "age", "Bldg.Type", "Neighborhood", "Sale.Type", "Sale.Condition", "Bsmt.Qual", "Bedroom.AbvGr")
NA_count <- c(dim(ames_train[which(is.na(ames_train$area)),])[[1]], dim(ames_train[which(is.na(ames_train$Lot.Area)),])[[1]], dim(ames_train[which(is.na(ames_train$price)),])[[1]], dim(ames_train[which(is.na(ames_train$age)),])[[1]], dim(ames_train[which(is.na(ames_train$Bldg.Type)),])[[1]], dim(ames_train[which(is.na(ames_train$Neighborhood)),])[[1]], dim(ames_train[which(is.na(ames_train$Sale.Type)),])[[1]], dim(ames_train[which(is.na(ames_train$Sale.Condition)),])[[1]], dim(ames_train[which(is.na(ames_train$Bsmt.Qual) | ames_train$Bedroom.AbvGr == ""),])[[1]], dim(ames_train[which(is.na(ames_train$Bedroom.AbvGr)),])[[1]])

NA_data = data.frame(var_name = NA_count_cols, NA_count)

NA_data
##          var_name NA_count
## 1            area        0
## 2        Lot.Area        0
## 3           price        0
## 4             age        0
## 5       Bldg.Type        0
## 6    Neighborhood        0
## 7       Sale.Type        0
## 8  Sale.Condition        0
## 9       Bsmt.Qual       21
## 10  Bedroom.AbvGr        0

Only Bsmt.Qual has NA values. As NA in Bsmt.Qual means there is no basement, we’ll replace NA with No and save it in a new column Bsmt.Qual.New

ames_train_bq <- ames_train %>% select(Bsmt.Qual)
ames_train_bq <- sapply(ames_train_bq, as.character)
ames_train_bq[is.na(ames_train_bq)] <- c("No")
ames_train_bq[ames_train_bq==""] <- c("No")
ames_train_bq <- data.frame(ames_train_bq)
colnames(ames_train_bq) <- "Bsmt.Qual.New"
ames_train <- cbind(ames_train, ames_train_bq)
ames_train %>% group_by(Bsmt.Qual.New) %>% summarise(count=n())
## # A tibble: 6 x 2
##   Bsmt.Qual.New count
##   <fct>         <int>
## 1 Ex               87
## 2 Fa               28
## 3 Gd              424
## 4 No               22
## 5 Po                1
## 6 TA              438

All NA's are removed. We can proceed with model.

2.2.1 Section 2.1 An Initial Model


initial_model <- lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Neighborhood + Sale.Type + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr, data=ames_train)
summary(initial_model)
## 
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + 
##     Neighborhood + Sale.Type + Sale.Condition + Bsmt.Qual.New + 
##     Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82370 -0.07893  0.00673  0.08691  0.65698 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.0604157  0.2189550  32.246  < 2e-16 ***
## log(area)              0.6549554  0.0267014  24.529  < 2e-16 ***
## log(Lot.Area)          0.0668661  0.0175700   3.806 0.000150 ***
## age                   -0.0040712  0.0004557  -8.933  < 2e-16 ***
## Bldg.Type2fmCon       -0.0438328  0.0397442  -1.103 0.270363    
## Bldg.TypeDuplex       -0.1240790  0.0324851  -3.820 0.000142 ***
## Bldg.TypeTwnhs        -0.1721567  0.0439861  -3.914 9.73e-05 ***
## Bldg.TypeTwnhsE       -0.0837392  0.0282645  -2.963 0.003126 ** 
## NeighborhoodBlueste   -0.0510795  0.1116689  -0.457 0.647475    
## NeighborhoodBrDale    -0.1189985  0.0825160  -1.442 0.149598    
## NeighborhoodBrkSide   -0.0598267  0.0668764  -0.895 0.371236    
## NeighborhoodClearCr   -0.0411322  0.0756287  -0.544 0.586659    
## NeighborhoodCollgCr   -0.0794834  0.0581222  -1.368 0.171786    
## NeighborhoodCrawfor    0.0810311  0.0658794   1.230 0.219006    
## NeighborhoodEdwards   -0.1731047  0.0605792  -2.857 0.004364 ** 
## NeighborhoodGilbert   -0.1687278  0.0602440  -2.801 0.005202 ** 
## NeighborhoodGreens     0.2473113  0.0985777   2.509 0.012281 *  
## NeighborhoodGrnHill    0.4857328  0.1308637   3.712 0.000218 ***
## NeighborhoodIDOTRR    -0.2965654  0.0681622  -4.351 1.50e-05 ***
## NeighborhoodMeadowV   -0.2631989  0.0694221  -3.791 0.000159 ***
## NeighborhoodMitchel   -0.0956257  0.0613052  -1.560 0.119135    
## NeighborhoodNAmes     -0.0696095  0.0596347  -1.167 0.243398    
## NeighborhoodNoRidge    0.0375115  0.0640297   0.586 0.558120    
## NeighborhoodNPkVill   -0.0017319  0.1016813  -0.017 0.986414    
## NeighborhoodNridgHt    0.1108936  0.0584668   1.897 0.058173 .  
## NeighborhoodNWAmes    -0.1020716  0.0619968  -1.646 0.100014    
## NeighborhoodOldTown   -0.1524135  0.0667131  -2.285 0.022556 *  
## NeighborhoodSawyer    -0.0767783  0.0618349  -1.242 0.214667    
## NeighborhoodSawyerW   -0.1456620  0.0601292  -2.422 0.015602 *  
## NeighborhoodSomerst    0.0016004  0.0555321   0.029 0.977014    
## NeighborhoodStoneBr    0.1669896  0.0647757   2.578 0.010089 *  
## NeighborhoodSWISU     -0.1305219  0.0784628  -1.663 0.096546 .  
## NeighborhoodTimber     0.0045418  0.0679828   0.067 0.946749    
## NeighborhoodVeenker   -0.0299609  0.0773795  -0.387 0.698700    
## Sale.TypeCon           0.2056553  0.0857099   2.399 0.016613 *  
## Sale.TypeConLD         0.0104395  0.0739765   0.141 0.887806    
## Sale.TypeConLI         0.0774655  0.0830963   0.932 0.351451    
## Sale.TypeConLw         0.0811007  0.0768599   1.055 0.291614    
## Sale.TypeCWD           0.2134994  0.1026614   2.080 0.037827 *  
## Sale.TypeNew          -0.0578622  0.1104776  -0.524 0.600578    
## Sale.TypeOth           0.0046133  0.0917232   0.050 0.959897    
## Sale.TypeVWD           0.0409980  0.1708287   0.240 0.810386    
## Sale.TypeWD            0.0823896  0.0343927   2.396 0.016789 *  
## Sale.ConditionAdjLand  0.2147700  0.1268012   1.694 0.090641 .  
## Sale.ConditionAlloca   0.2920297  0.0880430   3.317 0.000945 ***
## Sale.ConditionFamily  -0.0409188  0.0473810  -0.864 0.388020    
## Sale.ConditionNormal   0.1108491  0.0235083   4.715 2.78e-06 ***
## Sale.ConditionPartial  0.2661889  0.1056136   2.520 0.011886 *  
## Bsmt.Qual.NewFa       -0.2413341  0.0473529  -5.096 4.18e-07 ***
## Bsmt.Qual.NewGd       -0.1488620  0.0238543  -6.240 6.57e-10 ***
## Bsmt.Qual.NewNo       -0.4501281  0.0478818  -9.401  < 2e-16 ***
## Bsmt.Qual.NewPo       -0.2439410  0.1720251  -1.418 0.156504    
## Bsmt.Qual.NewTA       -0.1818053  0.0307897  -5.905 4.92e-09 ***
## Bedroom.AbvGr         -0.0497919  0.0096609  -5.154 3.10e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1659 on 946 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8444 
## F-statistic: 103.3 on 53 and 946 DF,  p-value: < 2.2e-16

2.2.1.1 Section 2.1.1 Model Variable Explanation/Discussion

  • log(area) - Main factor for the price determination with the co-eff 0.6549554
  • log(Lot.Area) - less important factor that log(area) with the co-eff 0.0668661
  • Neighborhoods like ‘Crawfor’, ‘Greens’, ‘Grnhill’, ‘NoRidge’, ‘NridgHt’, ‘Somerst’, ‘StoneBr’ and ‘Timber’ all has positive co-eff which will increase the price, thus making the places a bit expensive
  • Other than Sale.Type new, all others increases the housing price
  • BedRoom.AbvGr - Seems a good factor with the co-eff -0.0497919
  • Bsmt.Qual.New - decreases the price
  • Sale.Condition increases the price other than the value Family
  • Age - decreases the price with the co-eff 0.0040712
  • Adjusted R-squared value is 0.8444 which is a good indicator for the initial model, with value almost near to 1

2.2.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


2.2.2.1 Section 2.2.1 Model selection using BIC with a uniform prior

initial_model_bic <- bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Sale.Type + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr + Neighborhood,data = ames_train, prior = "BIC", modelprior = uniform())
## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length
## Warning in bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type
## + : bestmodel violates heredity conditions; resetting to null model
summary(initial_model_bic)
##                       P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.000000e+00     1.0000     1.0000000  1.000000e+00
## log(area)              1.000000e+00     1.0000     1.0000000  1.000000e+00
## log(Lot.Area)          9.875247e-01     1.0000     1.0000000  0.000000e+00
## age                    1.000000e+00     1.0000     1.0000000  1.000000e+00
## Bldg.Type2fmCon        9.043871e-01     1.0000     0.0000000  1.000000e+00
## Bldg.TypeDuplex        9.043871e-01     1.0000     0.0000000  1.000000e+00
## Bldg.TypeTwnhs         9.043871e-01     1.0000     0.0000000  1.000000e+00
## Bldg.TypeTwnhsE        9.043871e-01     1.0000     0.0000000  1.000000e+00
## Sale.TypeCon           6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeConLD         6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeConLI         6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeConLw         6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeCWD           6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeNew           6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeOth           6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeVWD           6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.TypeWD            6.698193e-11     0.0000     0.0000000  0.000000e+00
## Sale.ConditionAdjLand  9.997803e-01     1.0000     1.0000000  1.000000e+00
## Sale.ConditionAlloca   9.997803e-01     1.0000     1.0000000  1.000000e+00
## Sale.ConditionFamily   9.997803e-01     1.0000     1.0000000  1.000000e+00
## Sale.ConditionNormal   9.997803e-01     1.0000     1.0000000  1.000000e+00
## Sale.ConditionPartial  9.997803e-01     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewFa        1.000000e+00     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewGd        1.000000e+00     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewNo        1.000000e+00     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewPo        1.000000e+00     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewTA        1.000000e+00     1.0000     1.0000000  1.000000e+00
## Bedroom.AbvGr          9.999793e-01     1.0000     1.0000000  1.000000e+00
## NeighborhoodBlueste    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodBrDale     1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodBrkSide    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodClearCr    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodCollgCr    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodCrawfor    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodEdwards    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodGilbert    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodGreens     1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodGrnHill    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodIDOTRR     1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodMeadowV    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodMitchel    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodNAmes      1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodNoRidge    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodNPkVill    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodNridgHt    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodNWAmes     1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodOldTown    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodSawyer     1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodSawyerW    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodSomerst    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodStoneBr    1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodSWISU      1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodTimber     1.000000e+00     1.0000     1.0000000  1.000000e+00
## NeighborhoodVeenker    1.000000e+00     1.0000     1.0000000  1.000000e+00
## BF                               NA     1.0000     0.1072144  1.397977e-02
## PostProbs                        NA     0.8917     0.0956000  1.250000e-02
## R2                               NA     0.8504     0.8455000  8.481000e-01
## dim                              NA    45.0000    41.0000000  4.400000e+01
## logmarg                          NA -1792.7887 -1795.0215757 -1.797059e+03
##                             model 4       model 5
## Intercept              1.000000e+00  1.000000e+00
## log(area)              1.000000e+00  1.000000e+00
## log(Lot.Area)          1.000000e+00  1.000000e+00
## age                    1.000000e+00  1.000000e+00
## Bldg.Type2fmCon        1.000000e+00  1.000000e+00
## Bldg.TypeDuplex        1.000000e+00  1.000000e+00
## Bldg.TypeTwnhs         1.000000e+00  1.000000e+00
## Bldg.TypeTwnhsE        1.000000e+00  1.000000e+00
## Sale.TypeCon           0.000000e+00  0.000000e+00
## Sale.TypeConLD         0.000000e+00  0.000000e+00
## Sale.TypeConLI         0.000000e+00  0.000000e+00
## Sale.TypeConLw         0.000000e+00  0.000000e+00
## Sale.TypeCWD           0.000000e+00  0.000000e+00
## Sale.TypeNew           0.000000e+00  0.000000e+00
## Sale.TypeOth           0.000000e+00  0.000000e+00
## Sale.TypeVWD           0.000000e+00  0.000000e+00
## Sale.TypeWD            0.000000e+00  0.000000e+00
## Sale.ConditionAdjLand  0.000000e+00  1.000000e+00
## Sale.ConditionAlloca   0.000000e+00  1.000000e+00
## Sale.ConditionFamily   0.000000e+00  1.000000e+00
## Sale.ConditionNormal   0.000000e+00  1.000000e+00
## Sale.ConditionPartial  0.000000e+00  1.000000e+00
## Bsmt.Qual.NewFa        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewGd        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewNo        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewPo        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewTA        1.000000e+00  1.000000e+00
## Bedroom.AbvGr          1.000000e+00  0.000000e+00
## NeighborhoodBlueste    1.000000e+00  1.000000e+00
## NeighborhoodBrDale     1.000000e+00  1.000000e+00
## NeighborhoodBrkSide    1.000000e+00  1.000000e+00
## NeighborhoodClearCr    1.000000e+00  1.000000e+00
## NeighborhoodCollgCr    1.000000e+00  1.000000e+00
## NeighborhoodCrawfor    1.000000e+00  1.000000e+00
## NeighborhoodEdwards    1.000000e+00  1.000000e+00
## NeighborhoodGilbert    1.000000e+00  1.000000e+00
## NeighborhoodGreens     1.000000e+00  1.000000e+00
## NeighborhoodGrnHill    1.000000e+00  1.000000e+00
## NeighborhoodIDOTRR     1.000000e+00  1.000000e+00
## NeighborhoodMeadowV    1.000000e+00  1.000000e+00
## NeighborhoodMitchel    1.000000e+00  1.000000e+00
## NeighborhoodNAmes      1.000000e+00  1.000000e+00
## NeighborhoodNoRidge    1.000000e+00  1.000000e+00
## NeighborhoodNPkVill    1.000000e+00  1.000000e+00
## NeighborhoodNridgHt    1.000000e+00  1.000000e+00
## NeighborhoodNWAmes     1.000000e+00  1.000000e+00
## NeighborhoodOldTown    1.000000e+00  1.000000e+00
## NeighborhoodSawyer     1.000000e+00  1.000000e+00
## NeighborhoodSawyerW    1.000000e+00  1.000000e+00
## NeighborhoodSomerst    1.000000e+00  1.000000e+00
## NeighborhoodStoneBr    1.000000e+00  1.000000e+00
## NeighborhoodSWISU      1.000000e+00  1.000000e+00
## NeighborhoodTimber     1.000000e+00  1.000000e+00
## NeighborhoodVeenker    1.000000e+00  1.000000e+00
## BF                     2.247309e-04  2.201862e-05
## PostProbs              2.000000e-04  0.000000e+00
## R2                     8.425000e-01  8.461000e-01
## dim                    4.000000e+01  4.400000e+01
## logmarg               -1.801189e+03 -1.803512e+03
image(initial_model_bic, rotate = FALSE)

As per the summary(initial_model_bic), we can notice that inclusion probability of Sale.Type variable is 0. So we can remove that from our model.

Residuals vs Fitted model : Our model holds good as there is no much influence due to the outliers

plot(initial_model_bic, which=1)

Inclusion Probabilities Model : Indicating that Sale.Type can be removed from the model

plot(initial_model_bic, which=4)

2.2.2.2 Section 2.2.1 Model selection using AIC with a uniform prior

initial_model_aic <- bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Sale.Type + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr + Neighborhood,data = ames_train, prior = "AIC", modelprior = uniform())
## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length
## Warning in bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type
## + : bestmodel violates heredity conditions; resetting to null model
summary(initial_model_aic)
##                       P(B != 0 | Y)    model 1       model 2       model 3
## Intercept                 1.0000000     1.0000     1.0000000  1.000000e+00
## log(area)                 1.0000000     1.0000     1.0000000  1.000000e+00
## log(Lot.Area)             0.9987649     1.0000     1.0000000  0.000000e+00
## age                       1.0000000     1.0000     1.0000000  1.000000e+00
## Bldg.Type2fmCon           0.9999948     1.0000     1.0000000  1.000000e+00
## Bldg.TypeDuplex           0.9999948     1.0000     1.0000000  1.000000e+00
## Bldg.TypeTwnhs            0.9999948     1.0000     1.0000000  1.000000e+00
## Bldg.TypeTwnhsE           0.9999948     1.0000     1.0000000  1.000000e+00
## Sale.TypeCon              0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeConLD            0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeConLI            0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeConLw            0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeCWD              0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeNew              0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeOth              0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeVWD              0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.TypeWD               0.2145579     0.0000     1.0000000  0.000000e+00
## Sale.ConditionAdjLand     1.0000000     1.0000     1.0000000  1.000000e+00
## Sale.ConditionAlloca      1.0000000     1.0000     1.0000000  1.000000e+00
## Sale.ConditionFamily      1.0000000     1.0000     1.0000000  1.000000e+00
## Sale.ConditionNormal      1.0000000     1.0000     1.0000000  1.000000e+00
## Sale.ConditionPartial     1.0000000     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewFa           1.0000000     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewGd           1.0000000     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewNo           1.0000000     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewPo           1.0000000     1.0000     1.0000000  1.000000e+00
## Bsmt.Qual.NewTA           1.0000000     1.0000     1.0000000  1.000000e+00
## Bedroom.AbvGr             0.9999979     1.0000     1.0000000  1.000000e+00
## NeighborhoodBlueste       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodBrDale        1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodBrkSide       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodClearCr       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodCollgCr       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodCrawfor       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodEdwards       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodGilbert       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodGreens        1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodGrnHill       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodIDOTRR        1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodMeadowV       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodMitchel       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodNAmes         1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodNoRidge       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodNPkVill       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodNridgHt       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodNWAmes        1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodOldTown       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodSawyer        1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodSawyerW       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodSomerst       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodStoneBr       1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodSWISU         1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodTimber        1.0000000     1.0000     1.0000000  1.000000e+00
## NeighborhoodVeenker       1.0000000     1.0000     1.0000000  1.000000e+00
## BF                               NA     1.0000     0.2731246  1.201696e-03
## PostProbs                        NA     0.7845     0.2143000  9.000000e-04
## R2                               NA     0.8504     0.8527000  8.481000e-01
## dim                              NA    45.0000    54.0000000  4.400000e+01
## logmarg                          NA -1682.3642 -1683.6619841 -1.689088e+03
##                             model 4       model 5
## Intercept              1.000000e+00  1.000000e+00
## log(area)              1.000000e+00  1.000000e+00
## log(Lot.Area)          0.000000e+00  1.000000e+00
## age                    1.000000e+00  1.000000e+00
## Bldg.Type2fmCon        1.000000e+00  0.000000e+00
## Bldg.TypeDuplex        1.000000e+00  0.000000e+00
## Bldg.TypeTwnhs         1.000000e+00  0.000000e+00
## Bldg.TypeTwnhsE        1.000000e+00  0.000000e+00
## Sale.TypeCon           1.000000e+00  0.000000e+00
## Sale.TypeConLD         1.000000e+00  0.000000e+00
## Sale.TypeConLI         1.000000e+00  0.000000e+00
## Sale.TypeConLw         1.000000e+00  0.000000e+00
## Sale.TypeCWD           1.000000e+00  0.000000e+00
## Sale.TypeNew           1.000000e+00  0.000000e+00
## Sale.TypeOth           1.000000e+00  0.000000e+00
## Sale.TypeVWD           1.000000e+00  0.000000e+00
## Sale.TypeWD            1.000000e+00  0.000000e+00
## Sale.ConditionAdjLand  1.000000e+00  1.000000e+00
## Sale.ConditionAlloca   1.000000e+00  1.000000e+00
## Sale.ConditionFamily   1.000000e+00  1.000000e+00
## Sale.ConditionNormal   1.000000e+00  1.000000e+00
## Sale.ConditionPartial  1.000000e+00  1.000000e+00
## Bsmt.Qual.NewFa        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewGd        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewNo        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewPo        1.000000e+00  1.000000e+00
## Bsmt.Qual.NewTA        1.000000e+00  1.000000e+00
## Bedroom.AbvGr          1.000000e+00  1.000000e+00
## NeighborhoodBlueste    1.000000e+00  1.000000e+00
## NeighborhoodBrDale     1.000000e+00  1.000000e+00
## NeighborhoodBrkSide    1.000000e+00  1.000000e+00
## NeighborhoodClearCr    1.000000e+00  1.000000e+00
## NeighborhoodCollgCr    1.000000e+00  1.000000e+00
## NeighborhoodCrawfor    1.000000e+00  1.000000e+00
## NeighborhoodEdwards    1.000000e+00  1.000000e+00
## NeighborhoodGilbert    1.000000e+00  1.000000e+00
## NeighborhoodGreens     1.000000e+00  1.000000e+00
## NeighborhoodGrnHill    1.000000e+00  1.000000e+00
## NeighborhoodIDOTRR     1.000000e+00  1.000000e+00
## NeighborhoodMeadowV    1.000000e+00  1.000000e+00
## NeighborhoodMitchel    1.000000e+00  1.000000e+00
## NeighborhoodNAmes      1.000000e+00  1.000000e+00
## NeighborhoodNoRidge    1.000000e+00  1.000000e+00
## NeighborhoodNPkVill    1.000000e+00  1.000000e+00
## NeighborhoodNridgHt    1.000000e+00  1.000000e+00
## NeighborhoodNWAmes     1.000000e+00  1.000000e+00
## NeighborhoodOldTown    1.000000e+00  1.000000e+00
## NeighborhoodSawyer     1.000000e+00  1.000000e+00
## NeighborhoodSawyerW    1.000000e+00  1.000000e+00
## NeighborhoodSomerst    1.000000e+00  1.000000e+00
## NeighborhoodStoneBr    1.000000e+00  1.000000e+00
## NeighborhoodSWISU      1.000000e+00  1.000000e+00
## NeighborhoodTimber     1.000000e+00  1.000000e+00
## NeighborhoodVeenker    1.000000e+00  1.000000e+00
## BF                     3.726636e-04  5.853706e-06
## PostProbs              3.000000e-04  0.000000e+00
## R2                     8.504000e-01  8.455000e-01
## dim                    5.300000e+01  4.100000e+01
## logmarg               -1.690259e+03 -1.694413e+03
image(initial_model_aic, rotate = FALSE)

As per the summary(initial_model_aic), we can notice that inclusion probability of Sale.Type variable is 0 in Model 1. So we can remove that from our model.

Residuals vs Fitted model : Our model holds good as there is no much influence due to the outliers

plot(initial_model_aic, which=1)

Inclusion Probabilities Model : Indicating that Sale.Type can be removed from the model because of the lower inclusion probability 0.2145579

plot(initial_model_aic, which=4)


2.2.2.3 Section 2.2.3 : Comparison of BIC and AIC

Both the model’s output are almost same (with same R2) and the result is to remove the variable Sale.Type. But which is better? As we know that AIC will give penalty for the parameters far less than that of BIC for which the penalty is the number of observations. BIC is parsimonious.

From the above 2 summaries the inclusion probability of the variables

variable name (with model) inclusion probability
log(Lot.Area) - BIC 9.875247e-01
log(Lot.Area) - AIC 0.9987649
Bldg.Type - BIC 6.698193e-11
Bldg.Type - AIC 0.2145579

We can see that BIC has put more penalty for the variable Bldg.Type. So we’ll follow BIC model

So our new preferred model is

initial_preferred_model <- lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Neighborhood + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr, data=ames_train)

2.2.3 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


Distribution is normal around 0

hist(initial_preferred_model$residuals, breaks=100)

Residuals are scattered around 0

plot(initial_preferred_model$residuals, main="Residual distribution of the latest model from BIC")

2.2.3.1 Outliers

Sorting based on the residuals, we’ll get the outliers as

ames_train[names(head(sort(initial_preferred_model$residuals), 10)),c("price","age", "area", "Lot.Area", "Bldg.Type", "Neighborhood", "Sale.Condition", "Bsmt.Qual.New", "Bedroom.AbvGr")]
##      price age area Lot.Area Bldg.Type Neighborhood Sale.Condition
## 428  12789  95  832     9656      1Fam      OldTown        Abnorml
## 310 184750  11 4676    40094      1Fam      Edwards        Partial
## 741  40000  98 1317     8500      1Fam       IDOTRR         Normal
## 268  55000  98 1086    11340      1Fam      OldTown         Normal
## 206  50000 108 1484    10320      1Fam       IDOTRR        Abnorml
## 276 150000  41 2944    24572      1Fam      Veenker         Family
## 559  34900  98  720     7879      1Fam       IDOTRR        Abnorml
## 181  82500  41 1411    11900      1Fam       NWAmes         Family
## 511  63000  84 1112     9780      1Fam      Edwards         Normal
## 645  35311  69  480     9000      1Fam       IDOTRR        Abnorml
##     Bsmt.Qual.New Bedroom.AbvGr
## 428            Fa             2
## 310            Ex             3
## 741            TA             3
## 268            Fa             2
## 206            TA             3
## 276            Gd             3
## 559            TA             2
## 181            TA             3
## 511            TA             4
## 645            TA             1
  • 6 out of 10 houses are more than the age 80
  • All Bldg.Type values are 1Fam1
  • 7 out of 10 houses are sold not in the condition Normal
  • 5 out of 10 houses has the basement quality TA

2.2.4 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


Since the price is in log, we’ll use exp to get the original dollar values. And we use predict function for in-sample prediction. So the RMSE will be in dollars

initial_predicted_values <- exp(predict(initial_preferred_model))
initial_residues <- ames_train$price - initial_predicted_values
initial_rmse <- sqrt(mean(initial_residues^2))
initial_rmse
## [1] 31033.15

2.2.5 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called ???overfitting.??? To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


Checking for NA’s and replacing them with ‘No’

ames_test <- ames_test %>% mutate(age=2018-Year.Built)

c(dim(ames_test[which(is.na(ames_test$area)),])[[1]], dim(ames_test[which(is.na(ames_test$Lot.Area)),])[[1]], dim(ames_test[which(is.na(ames_test$price)),])[[1]], dim(ames_test[which(is.na(ames_test$age)),])[[1]], dim(ames_test[which(is.na(ames_test$Bldg.Type)),])[[1]], dim(ames_test[which(is.na(ames_test$Neighborhood)),])[[1]], dim(ames_test[which(is.na(ames_test$Sale.Type)),])[[1]], dim(ames_test[which(is.na(ames_test$Sale.Condition)),])[[1]], dim(ames_test[which(is.na(ames_test$Bsmt.Qual)),])[[1]]+dim(ames_test[which(ames_test$Bedroom.Qual == ""),])[[1]], dim(ames_test[which(is.na(ames_train$Bedroom.AbvGr)),])[[1]])
## Warning: Unknown or uninitialised column: 'Bedroom.Qual'.
##  [1]  0  0  0  0  0  0  0  0 22  0
ames_test_bq <- ames_test %>% select(Bsmt.Qual)
ames_test_bq <- sapply(ames_test_bq, as.character)
ames_test_bq[is.na(ames_test_bq)] <- c("No")
ames_test_bq[ames_test_bq==""] <- c("No")
ames_test_bq <- data.frame(ames_test_bq)
colnames(ames_test_bq) <- "Bsmt.Qual.New"
ames_test <- cbind(ames_test, ames_test_bq)
model_var_level <- names(initial_preferred_model$xlevels)
initial_preferred_model$xlevels <- sapply(model_var_level, function(x) union(initial_preferred_model$xlevels[[x]], levels(ames_test[[x]])))
predict_test_data <- exp(predict(initial_preferred_model, newdata=ames_test))
## Warning in predict.lm(initial_preferred_model, newdata = ames_test):
## prediction from a rank-deficient fit may be misleading
initial_test_residue <- ames_test$price - predict_test_data
initial_test_rmse <- sqrt(mean(initial_test_residue^2, na.rm = TRUE))
initial_test_rmse
## [1] 48191.67

The RMSE difference between 48191.67 (for testing data) and 31033.15 (for training data) is very huge. This is due the potential outliers and the model has to be tuned to reduce the difference in the upcoming stages.

Below is the residual comparison of the 2 data sets. We’ll have a variable ‘type’ in the new data frame initial_residual_comparison which will indicate from where the data came from (whether it is training set or testing set). Value 1 is for training set and 2 for the testing set.

residual_comp1 <- data.frame(index<- seq(from=1, to=dim(ames_train)[1], by=1), residue <- initial_residues, type <- rep(1,dim(ames_train)[1]))
residual_comp2 <- data.frame(index<- seq(from=1, to=dim(ames_test)[1], by=1), residue <- initial_test_residue, type <- rep(2,dim(ames_test)[1]))
colnames(residual_comp1) <- c("index", "residue", "type")
colnames(residual_comp2) <- c("index", "residue", "type")
initial_residual_comparison <- rbind(residual_comp1, residual_comp2)
ggplot(initial_residual_comparison, aes(x=index, y=residue)) + geom_point(aes(color=type, alpha=0.5))

We can observe that the testing set’s residue are more dispersed than that of training set.


Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

2.3 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

#Function to replace NAs
ames_transform <- function(data){
    var_to_use <- data %>% summarise_all(funs(length(unique(.))))
    var_to_use <- names(data.frame(var_to_use)[,var_to_use>1])

    data <- data %>% select(var_to_use)

    data_types <- split(names(data), sapply(data, function(x) paste(class(x), collapse = " ")))
    data_int <- data %>% select(data_types$integer)

    ames_train_fac <- data %>% select(data_types$factor)
    ames_train_fac <- sapply(ames_train_fac, as.character)
    ames_train_fac[is.na(ames_train_fac)] <- c("None")
    ames_train_fac[ames_train_fac==""] <- c("None")
    ames_train_fac <- data.frame(ames_train_fac)

    data <- cbind(data_int, ames_train_fac)
    data <- data %>% mutate(age= 2018 - Year.Built)

    data$Year.Built <- NULL

    return(data) 
}

2.3.1 Section 3.1 Final Model

Provide the summary table for your model.


ames_train <- ames_transform(ames_train)

final_lm <- lm(formula = log(price)~., data = na.omit(ames_train))
n <- nrow(na.omit(ames_train))

final_BIC <- stepAIC(final_lm, direction = "backward", k=n, trace = 0, steps = 1000)
summary(final_BIC)
## 
## Call:
## lm(formula = log(price) ~ Overall.Qual, data = na.omit(ames_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53757 -0.11989  0.01186  0.13492  0.94285 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.494269   0.037090  282.94   <2e-16 ***
## Overall.Qual  0.249820   0.005859   42.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2338 on 779 degrees of freedom
## Multiple R-squared:  0.7001, Adjusted R-squared:  0.6997 
## F-statistic:  1818 on 1 and 779 DF,  p-value: < 2.2e-16
final_AIC <- stepAIC(final_lm, direction = "backward", k=2, trace = 0, steps = 1000)
summary(final_AIC)
## 
## Call:
## lm(formula = log(price) ~ area + Lot.Frontage + Overall.Qual + 
##     Overall.Cond + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 + 
##     X1st.Flr.SF + Bsmt.Full.Bath + Kitchen.AbvGr + Fireplaces + 
##     Garage.Cars + Enclosed.Porch + X3Ssn.Porch + Screen.Porch + 
##     MS.Zoning + Lot.Shape + Lot.Config + Land.Slope + Neighborhood + 
##     Condition.1 + Condition.2 + Bldg.Type + Exterior.2nd + Exter.Qual + 
##     Exter.Cond + Foundation + Bsmt.Qual + Bsmt.Exposure + Heating + 
##     Central.Air + Kitchen.Qual + Functional + Garage.Type + Garage.Qual + 
##     Garage.Cond + Sale.Type + Sale.Condition + age, data = na.omit(ames_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81141 -0.05332  0.00198  0.04965  0.66353 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.932e+00  7.275e-01  10.903  < 2e-16 ***
## area                   2.903e-04  1.553e-05  18.688  < 2e-16 ***
## Lot.Frontage          -5.980e-04  2.732e-04  -2.189 0.028980 *  
## Overall.Qual           4.060e-02  6.312e-03   6.431 2.47e-10 ***
## Overall.Cond           4.451e-02  5.639e-03   7.892 1.28e-14 ***
## Year.Remod.Add         8.696e-04  3.512e-04   2.476 0.013529 *  
## BsmtFin.SF.1           8.113e-05  1.497e-05   5.420 8.45e-08 ***
## BsmtFin.SF.2           4.458e-05  2.926e-05   1.523 0.128129    
## X1st.Flr.SF            7.318e-05  1.895e-05   3.863 0.000124 ***
## Bsmt.Full.Bath         2.586e-02  1.166e-02   2.217 0.026979 *  
## Kitchen.AbvGr          1.304e-01  4.717e-02   2.765 0.005860 ** 
## Fireplaces             1.656e-02  8.909e-03   1.859 0.063511 .  
## Garage.Cars            3.261e-02  1.036e-02   3.147 0.001728 ** 
## Enclosed.Porch         1.414e-04  8.342e-05   1.694 0.090656 .  
## X3Ssn.Porch            2.846e-04  1.285e-04   2.214 0.027152 *  
## Screen.Porch           1.704e-04  7.717e-05   2.209 0.027553 *  
## MS.ZoningFV            3.616e-01  6.987e-02   5.175 3.05e-07 ***
## MS.ZoningRH            4.993e-01  7.965e-02   6.269 6.68e-10 ***
## MS.ZoningRL            4.566e-01  6.044e-02   7.553 1.46e-13 ***
## MS.ZoningRM            3.617e-01  5.548e-02   6.520 1.42e-10 ***
## Lot.ShapeIR2           2.876e-02  2.793e-02   1.030 0.303529    
## Lot.ShapeIR3           3.369e-01  7.783e-02   4.329 1.73e-05 ***
## Lot.ShapeReg          -3.381e-03  1.068e-02  -0.317 0.751656    
## Lot.ConfigCulDSac     -3.945e-02  2.542e-02  -1.552 0.121121    
## Lot.ConfigFR2         -5.461e-02  2.646e-02  -2.064 0.039412 *  
## Lot.ConfigFR3         -1.396e-01  6.661e-02  -2.095 0.036536 *  
## Lot.ConfigInside      -3.436e-02  1.183e-02  -2.904 0.003815 ** 
## Land.SlopeMod         -1.417e-02  2.464e-02  -0.575 0.565489    
## Land.SlopeSev         -2.065e-01  9.578e-02  -2.156 0.031482 *  
## NeighborhoodBlueste    6.961e-02  8.468e-02   0.822 0.411377    
## NeighborhoodBrDale     7.097e-02  7.033e-02   1.009 0.313254    
## NeighborhoodBrkSide    1.326e-01  6.125e-02   2.165 0.030721 *  
## NeighborhoodClearCr    1.007e-01  7.317e-02   1.377 0.169015    
## NeighborhoodCollgCr   -5.692e-03  4.752e-02  -0.120 0.904687    
## NeighborhoodCrawfor    1.815e-01  5.721e-02   3.172 0.001583 ** 
## NeighborhoodEdwards   -4.248e-03  5.157e-02  -0.082 0.934381    
## NeighborhoodGilbert    5.458e-03  4.960e-02   0.110 0.912414    
## NeighborhoodGreens     9.118e-02  8.161e-02   1.117 0.264313    
## NeighborhoodIDOTRR     7.521e-02  6.823e-02   1.102 0.270766    
## NeighborhoodMeadowV   -4.861e-02  7.184e-02  -0.677 0.498854    
## NeighborhoodMitchel    3.263e-02  5.206e-02   0.627 0.530971    
## NeighborhoodNAmes      2.700e-03  5.119e-02   0.053 0.957945    
## NeighborhoodNPkVill    1.007e-01  1.185e-01   0.849 0.395919    
## NeighborhoodNWAmes    -2.596e-02  5.336e-02  -0.487 0.626707    
## NeighborhoodNoRidge    6.638e-02  5.465e-02   1.215 0.224984    
## NeighborhoodNridgHt    1.170e-01  4.685e-02   2.496 0.012795 *  
## NeighborhoodOldTown    1.619e-02  6.243e-02   0.259 0.795415    
## NeighborhoodSWISU      3.746e-02  6.515e-02   0.575 0.565494    
## NeighborhoodSawyer     7.070e-02  5.316e-02   1.330 0.183993    
## NeighborhoodSawyerW   -1.085e-02  4.948e-02  -0.219 0.826487    
## NeighborhoodSomerst    1.796e-01  5.343e-02   3.361 0.000823 ***
## NeighborhoodStoneBr    1.730e-01  5.168e-02   3.348 0.000862 ***
## NeighborhoodTimber     5.207e-02  5.404e-02   0.964 0.335627    
## NeighborhoodVeenker    1.007e-01  6.162e-02   1.634 0.102739    
## Condition.1Feedr      -3.234e-02  3.992e-02  -0.810 0.418148    
## Condition.1Norm        5.170e-02  3.352e-02   1.543 0.123437    
## Condition.1PosA        9.648e-02  5.832e-02   1.654 0.098573 .  
## Condition.1PosN        1.059e-01  5.822e-02   1.819 0.069392 .  
## Condition.1RRAe        1.244e-02  5.371e-02   0.232 0.816940    
## Condition.1RRAn       -2.675e-02  5.184e-02  -0.516 0.606011    
## Condition.2Feedr       2.724e-01  1.478e-01   1.843 0.065856 .  
## Condition.2Norm        3.870e-01  1.395e-01   2.774 0.005703 ** 
## Condition.2PosA        3.196e-01  1.868e-01   1.711 0.087596 .  
## Condition.2PosN       -5.597e-01  1.745e-01  -3.207 0.001407 ** 
## Condition.2RRNn        3.974e-01  1.830e-01   2.172 0.030241 *  
## Bldg.Type2fmCon       -1.569e-02  4.244e-02  -0.370 0.711766    
## Bldg.TypeDuplex       -1.703e-01  4.793e-02  -3.553 0.000409 ***
## Bldg.TypeTwnhs        -1.589e-01  3.275e-02  -4.852 1.54e-06 ***
## Bldg.TypeTwnhsE       -9.328e-02  2.261e-02  -4.126 4.18e-05 ***
## Exterior.2ndBrk Cmn    2.037e-01  1.382e-01   1.474 0.141029    
## Exterior.2ndBrkFace    3.468e-01  5.114e-02   6.781 2.71e-11 ***
## Exterior.2ndCmentBd    2.493e-01  5.390e-02   4.625 4.53e-06 ***
## Exterior.2ndHdBoard    2.538e-01  4.651e-02   5.458 6.87e-08 ***
## Exterior.2ndImStucc    2.195e-01  6.187e-02   3.548 0.000416 ***
## Exterior.2ndMetalSd    3.001e-01  4.487e-02   6.687 4.95e-11 ***
## Exterior.2ndPlywood    2.429e-01  4.804e-02   5.058 5.55e-07 ***
## Exterior.2ndStucco     2.968e-01  5.731e-02   5.178 3.00e-07 ***
## Exterior.2ndVinylSd    2.798e-01  4.586e-02   6.102 1.81e-09 ***
## Exterior.2ndWd Sdng    3.021e-01  4.597e-02   6.572 1.03e-10 ***
## Exterior.2ndWd Shng    2.706e-01  4.989e-02   5.424 8.26e-08 ***
## Exter.QualFa           1.002e-01  5.848e-02   1.713 0.087264 .  
## Exter.QualGd          -3.819e-02  2.893e-02  -1.320 0.187315    
## Exter.QualTA          -6.860e-02  3.345e-02  -2.051 0.040715 *  
## Exter.CondFa          -1.436e-01  7.869e-02  -1.825 0.068393 .  
## Exter.CondGd           3.875e-04  6.688e-02   0.006 0.995379    
## Exter.CondTA           2.412e-02  6.584e-02   0.366 0.714229    
## FoundationCBlock       1.469e-02  2.012e-02   0.730 0.465493    
## FoundationPConc        5.204e-02  2.168e-02   2.400 0.016679 *  
## FoundationSlab         4.752e-02  7.795e-02   0.610 0.542357    
## FoundationStone        2.274e-02  6.921e-02   0.328 0.742646    
## Bsmt.QualFa           -1.756e-01  3.913e-02  -4.487 8.54e-06 ***
## Bsmt.QualGd           -5.573e-02  2.061e-02  -2.704 0.007041 ** 
## Bsmt.QualNone         -1.492e-01  1.261e-01  -1.184 0.237041    
## Bsmt.QualPo            9.760e-01  1.844e-01   5.293 1.65e-07 ***
## Bsmt.QualTA           -8.146e-02  2.665e-02  -3.057 0.002330 ** 
## Bsmt.ExposureGd        5.316e-02  1.922e-02   2.766 0.005831 ** 
## Bsmt.ExposureMn       -3.019e-02  1.842e-02  -1.639 0.101605    
## Bsmt.ExposureNo       -2.574e-02  1.284e-02  -2.004 0.045448 *  
## Bsmt.ExposureNone     -7.072e-02  1.083e-01  -0.653 0.513844    
## HeatingGasW            1.701e-01  5.564e-02   3.056 0.002334 ** 
## HeatingGrav            3.124e-01  1.576e-01   1.983 0.047818 *  
## HeatingWall            1.269e-01  1.252e-01   1.014 0.311036    
## Central.AirY           1.692e-01  2.992e-02   5.656 2.34e-08 ***
## Kitchen.QualFa        -9.575e-02  4.358e-02  -2.197 0.028382 *  
## Kitchen.QualGd        -3.295e-02  2.350e-02  -1.402 0.161397    
## Kitchen.QualPo         2.561e-01  1.451e-01   1.764 0.078136 .  
## Kitchen.QualTA        -5.048e-02  2.663e-02  -1.896 0.058441 .  
## FunctionalMaj2         2.617e-02  1.247e-01   0.210 0.833866    
## FunctionalMin1         4.461e-02  8.103e-02   0.550 0.582197    
## FunctionalMin2         8.289e-02  7.833e-02   1.058 0.290368    
## FunctionalMod          5.267e-02  8.112e-02   0.649 0.516385    
## FunctionalSal         -2.650e-01  1.558e-01  -1.701 0.089496 .  
## FunctionalTyp          1.222e-01  7.334e-02   1.666 0.096249 .  
## Garage.TypeAttchd      3.516e-02  4.443e-02   0.791 0.429055    
## Garage.TypeBasment    -2.603e-02  6.703e-02  -0.388 0.697870    
## Garage.TypeBuiltIn    -3.429e-02  4.833e-02  -0.710 0.478217    
## Garage.TypeDetchd      1.337e-02  4.555e-02   0.293 0.769302    
## Garage.QualFa         -1.400e-01  1.149e-01  -1.219 0.223333    
## Garage.QualGd         -9.882e-02  1.274e-01  -0.776 0.438054    
## Garage.QualPo         -7.601e-01  1.756e-01  -4.328 1.75e-05 ***
## Garage.QualTA         -1.485e-01  1.126e-01  -1.318 0.187845    
## Garage.CondFa         -1.207e-01  3.542e-02  -3.407 0.000698 ***
## Garage.CondGd          8.017e-02  7.719e-02   1.039 0.299404    
## Garage.CondPo          2.909e-01  6.710e-02   4.336 1.69e-05 ***
## Garage.CondTA                 NA         NA      NA       NA    
## Sale.TypeCWD           6.867e-02  7.013e-02   0.979 0.327828    
## Sale.TypeCon           3.059e-02  5.939e-02   0.515 0.606642    
## Sale.TypeConLD         1.235e-01  6.102e-02   2.024 0.043427 *  
## Sale.TypeConLI        -7.701e-02  7.441e-02  -1.035 0.301046    
## Sale.TypeConLw         1.622e-02  5.447e-02   0.298 0.766008    
## Sale.TypeNew          -6.809e-02  7.743e-02  -0.879 0.379531    
## Sale.TypeOth           1.952e-01  8.477e-02   2.303 0.021583 *  
## Sale.TypeVWD          -2.004e-02  1.181e-01  -0.170 0.865330    
## Sale.TypeWD           -4.421e-03  2.731e-02  -0.162 0.871457    
## Sale.ConditionAlloca   1.446e-01  8.545e-02   1.692 0.091059 .  
## Sale.ConditionFamily   5.949e-03  3.708e-02   0.160 0.872575    
## Sale.ConditionNormal   7.821e-02  1.881e-02   4.157 3.66e-05 ***
## Sale.ConditionPartial  1.608e-01  7.317e-02   2.198 0.028320 *  
## age                   -2.179e-03  4.728e-04  -4.608 4.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1064 on 643 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9378 
## F-statistic: 86.81 on 137 and 643 DF,  p-value: < 2.2e-16

Since BIC returns Overall.Qual as the significant predictor, we’ll ignore BIC model and use the AIC model instead.

And variables like below can be removed as the p-value with significance level of 5%

  • Garage.Type
  • Exter.Qual
  • Fireplaces
  • Lot.Frontage
  • Functional
  • Kitchen.Qual

Forward Selection of variables using AIC is done for the selection of better variables with the below code

## R CODE to do the forward selection
variables_chosen <- c()
var_count = 0
variables1 <- names(ames_train)
variables2 <- names(ames_train)

latest_AIC = 1/0

for(vartest in variables1){
    include_var = ""
    found = FALSE
    for(var in variables2){
        if(var == "price" || var %in% variables_chosen){
            next
        }
        if(var == "area"){
            var = "log(area)"
        }
        temp_variables_chosen <- variables_chosen
        temp_variables_chosen[[length(temp_variables_chosen)+1]] = var
        tempAIC = AIC(lm(as.formula(paste("log(price) ~ ", paste(c(temp_variables_chosen), collapse="+"))), data=ames_train ))
        if(tempAIC < latest_AIC){
            latest_AIC = tempAIC
            include_var = var;
            found = TRUE
        }
    }
    if(found){
        var_count = var_count + 1
        variables_chosen[[var_count]] = include_var
    }
}
print(variables_chosen)
print(latest_AIC)


> print(variables_chosen)
 [1] "Overall.Qual"   "log(area)"      "Neighborhood"   "BsmtFin.SF.1"   "Overall.Cond"   "age"            "Bldg.Type"      "Total.Bsmt.SF" 
 [9] "Sale.Condition" "Condition.2"    "Exter.Cond"     "Bsmt.Exposure"  "MS.Zoning"      "Garage.Cars"    "Exterior.1st"   "Exter.Qual"    
[17] "Condition.1"    "Bsmt.Full.Bath" "Central.Air"    "Lot.Shape"      "Garage.Cond"    "Bsmt.Qual"      "Garage.Qual"    "Kitchen.Qual"  
[25] "Garage.Yr.Blt"  "Fireplaces"     "Lot.Config"     "Heating"        "Heating.QC"     "Sale.Type"      "Kitchen.AbvGr"  "Functional"    
[33] "Year.Remod.Add" "Screen.Porch"   "Wood.Deck.SF"   "Foundation"     "X3Ssn.Porch"    "Bsmt.Cond"      "Bedroom.AbvGr"  "Paved.Drive"   
[41] "Open.Porch.SF" 
> print(latest_AIC)
[1] -1485.826

Even then, when we use it for fitting the test data it is not enough. So with the knowledge obtained from the summary based on p-value from summary(final_AIC) and the above forward selection, we narrowed it to the below model after many iterations.

aic_final_model <- stepAIC(lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF + Kitchen.AbvGr + Garage.Cars + Enclosed.Porch +  Screen.Porch + MS.Zoning + Land.Slope + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air + Sale.Condition + age, data = na.omit(ames_train)), direction = "backward", k=2, trace = 0, steps = 1000)
summary(aic_final_model)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF + 
##     Kitchen.AbvGr + Garage.Cars + Enclosed.Porch + Screen.Porch + 
##     MS.Zoning + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + 
##     Central.Air + Sale.Condition + age, data = na.omit(ames_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19931 -0.06467 -0.00153  0.07681  0.60390 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.018e+00  1.954e-01  41.035  < 2e-16 ***
## log(area)              3.581e-01  2.581e-02  13.877  < 2e-16 ***
## Overall.Qual           6.522e-02  7.468e-03   8.733  < 2e-16 ***
## X1st.Flr.SF            1.350e-04  2.058e-05   6.559 1.04e-10 ***
## Kitchen.AbvGr          8.835e-02  5.786e-02   1.527 0.127204    
## Garage.Cars            4.299e-02  1.234e-02   3.484 0.000524 ***
## Enclosed.Porch         2.745e-04  1.034e-04   2.654 0.008141 ** 
## Screen.Porch           3.121e-04  9.708e-05   3.214 0.001366 ** 
## MS.ZoningFV            2.511e-01  8.501e-02   2.954 0.003238 ** 
## MS.ZoningRH            2.246e-01  9.905e-02   2.268 0.023631 *  
## MS.ZoningRL            2.660e-01  7.325e-02   3.632 0.000302 ***
## MS.ZoningRM            1.947e-01  6.743e-02   2.887 0.004001 ** 
## NeighborhoodBlueste    8.588e-02  1.086e-01   0.791 0.429272    
## NeighborhoodBrDale     7.253e-02  8.923e-02   0.813 0.416611    
## NeighborhoodBrkSide    1.105e-01  7.559e-02   1.462 0.144207    
## NeighborhoodClearCr    1.523e-01  8.993e-02   1.693 0.090863 .  
## NeighborhoodCollgCr    4.454e-02  6.156e-02   0.724 0.469558    
## NeighborhoodCrawfor    2.423e-01  7.139e-02   3.395 0.000725 ***
## NeighborhoodEdwards    2.050e-02  6.480e-02   0.316 0.751883    
## NeighborhoodGilbert    2.608e-02  6.356e-02   0.410 0.681688    
## NeighborhoodGreens     2.453e-01  1.026e-01   2.392 0.016995 *  
## NeighborhoodIDOTRR     1.559e-02  8.320e-02   0.187 0.851395    
## NeighborhoodMeadowV   -3.530e-02  9.255e-02  -0.381 0.703044    
## NeighborhoodMitchel    8.910e-02  6.597e-02   1.351 0.177276    
## NeighborhoodNAmes      5.776e-02  6.432e-02   0.898 0.369470    
## NeighborhoodNPkVill    7.785e-02  1.570e-01   0.496 0.620156    
## NeighborhoodNWAmes     1.324e-02  6.748e-02   0.196 0.844460    
## NeighborhoodNoRidge    1.672e-01  6.951e-02   2.405 0.016404 *  
## NeighborhoodNridgHt    1.884e-01  6.027e-02   3.126 0.001844 ** 
## NeighborhoodOldTown    7.102e-02  7.753e-02   0.916 0.360008    
## NeighborhoodSWISU      3.565e-02  8.204e-02   0.435 0.663973    
## NeighborhoodSawyer     8.171e-02  6.727e-02   1.215 0.224849    
## NeighborhoodSawyerW    2.703e-02  6.366e-02   0.425 0.671259    
## NeighborhoodSomerst    1.218e-01  6.812e-02   1.788 0.074248 .  
## NeighborhoodStoneBr    2.382e-01  6.626e-02   3.595 0.000347 ***
## NeighborhoodTimber     1.280e-01  6.966e-02   1.837 0.066595 .  
## NeighborhoodVeenker    1.860e-01  7.797e-02   2.385 0.017333 *  
## Bldg.Type2fmCon       -1.973e-03  5.174e-02  -0.038 0.969597    
## Bldg.TypeDuplex       -1.791e-01  6.008e-02  -2.981 0.002968 ** 
## Bldg.TypeTwnhs        -1.732e-01  3.854e-02  -4.494 8.14e-06 ***
## Bldg.TypeTwnhsE       -1.019e-01  2.550e-02  -3.997 7.09e-05 ***
## Exterior.2ndBrk Cmn    2.602e-01  1.812e-01   1.436 0.151531    
## Exterior.2ndBrkFace    3.553e-01  6.463e-02   5.497 5.37e-08 ***
## Exterior.2ndCmentBd    3.020e-01  6.674e-02   4.526 7.05e-06 ***
## Exterior.2ndHdBoard    2.788e-01  5.794e-02   4.811 1.83e-06 ***
## Exterior.2ndImStucc    2.708e-01  7.892e-02   3.432 0.000634 ***
## Exterior.2ndMetalSd    3.243e-01  5.583e-02   5.809 9.46e-09 ***
## Exterior.2ndPlywood    2.471e-01  5.962e-02   4.144 3.82e-05 ***
## Exterior.2ndStucco     4.174e-01  7.059e-02   5.913 5.19e-09 ***
## Exterior.2ndVinylSd    2.996e-01  5.688e-02   5.267 1.84e-07 ***
## Exterior.2ndWd Sdng    3.110e-01  5.752e-02   5.407 8.75e-08 ***
## Exterior.2ndWd Shng    3.173e-01  6.212e-02   5.108 4.18e-07 ***
## Bsmt.QualFa           -2.077e-01  4.643e-02  -4.473 8.96e-06 ***
## Bsmt.QualGd           -9.612e-02  2.366e-02  -4.062 5.41e-05 ***
## Bsmt.QualNone         -2.993e-01  5.225e-02  -5.729 1.49e-08 ***
## Bsmt.QualPo            2.508e-01  1.618e-01   1.550 0.121482    
## Bsmt.QualTA           -1.466e-01  3.158e-02  -4.644 4.07e-06 ***
## Central.AirY           2.411e-01  2.998e-02   8.041 3.66e-15 ***
## Sale.ConditionAlloca   1.238e-01  1.096e-01   1.130 0.258966    
## Sale.ConditionFamily  -6.395e-02  4.560e-02  -1.403 0.161192    
## Sale.ConditionNormal   7.469e-02  2.294e-02   3.255 0.001187 ** 
## Sale.ConditionPartial  9.074e-02  3.026e-02   2.998 0.002807 ** 
## age                   -2.582e-03  4.994e-04  -5.170 3.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1431 on 718 degrees of freedom
## Multiple R-squared:  0.8965, Adjusted R-squared:  0.8875 
## F-statistic: 100.3 on 62 and 718 DF,  p-value: < 2.2e-16

It has the Adjusted R-squared value as .8875 indicating that 88.75 of the price houses are explained by the included variables.


2.3.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


We have transformed some variables like area, price and Lot.Area for the reasons mentioned in section 2.1.4.1 and also the adjusted R squared value will be higher.

summary(lm(formula= log(price)~Lot.Area, data = ames_train))$adj.r.squared < summary(lm(formula= log(price)~log(Lot.Area), data = ames_train))$adj.r.squared
## [1] TRUE

2.3.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


  • We have created a new variable age section 2.1.1 of EDA, as it will be easier to do the comparison between any numberical variables. One such relation has been shown in that section.
  • And also we removed NA values to ‘None’ for better model selection.
  • And we can check the collinearity between the variables and found to be that all the variables are not related to each other.
pairs(~Overall.Qual + Overall.Cond + BsmtFin.SF.1 + Garage.Cars + Bsmt.Qual + age + area,data=ames_train, main="Colinearity Check")


2.3.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


We’ll use the variables which we have selected in the section 3.1. But when we do the model for good data (i.e. with normal sale condition), we can see the increase in the Adjusted R-squared from 0.8875 to 0.9087. This means that without the outliers, the model can very well explain the house prices

ames_train_no_outliers <- ames_train %>% filter(Pool.QC != "None" | tolower(Sale.Condition) == "normal")
no_outlier_final_model <- stepAIC(lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF + Kitchen.AbvGr + Garage.Cars + Enclosed.Porch +  Screen.Porch + MS.Zoning + Land.Slope + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air + age, data = na.omit(ames_train_no_outliers)), direction = "backward", k=2, trace = 0, steps = 1000)
summary(no_outlier_final_model)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF + 
##     Garage.Cars + Screen.Porch + MS.Zoning + Land.Slope + Neighborhood + 
##     Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air + age, 
##     data = na.omit(ames_train_no_outliers))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56395 -0.05978  0.00161  0.06677  0.56112 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.045e+00  1.844e-01  43.628  < 2e-16 ***
## log(area)            3.924e-01  2.255e-02  17.400  < 2e-16 ***
## Overall.Qual         6.180e-02  6.688e-03   9.241  < 2e-16 ***
## X1st.Flr.SF          1.386e-04  1.922e-05   7.213 1.71e-12 ***
## Garage.Cars          6.011e-02  1.062e-02   5.660 2.37e-08 ***
## Screen.Porch         2.391e-04  8.587e-05   2.784 0.005546 ** 
## MS.ZoningFV          3.675e-01  8.210e-02   4.476 9.15e-06 ***
## MS.ZoningRH          2.484e-01  9.038e-02   2.749 0.006168 ** 
## MS.ZoningRL          3.524e-01  6.795e-02   5.187 2.96e-07 ***
## MS.ZoningRM          2.468e-01  6.377e-02   3.870 0.000121 ***
## Land.SlopeMod        2.701e-02  2.569e-02   1.052 0.293438    
## Land.SlopeSev        4.237e-01  1.246e-01   3.400 0.000720 ***
## NeighborhoodBlueste  1.666e-01  1.028e-01   1.620 0.105756    
## NeighborhoodBrDale   1.178e-01  9.266e-02   1.272 0.203970    
## NeighborhoodBrkSide  1.749e-01  8.286e-02   2.110 0.035252 *  
## NeighborhoodClearCr  1.967e-01  9.092e-02   2.164 0.030893 *  
## NeighborhoodCollgCr  7.891e-02  7.296e-02   1.082 0.279849    
## NeighborhoodCrawfor  2.768e-01  8.053e-02   3.438 0.000629 ***
## NeighborhoodEdwards  6.477e-02  7.572e-02   0.855 0.392670    
## NeighborhoodGilbert  6.624e-02  7.530e-02   0.880 0.379410    
## NeighborhoodGreens   2.845e-01  9.679e-02   2.939 0.003420 ** 
## NeighborhoodIDOTRR   1.556e-01  8.831e-02   1.762 0.078524 .  
## NeighborhoodMeadowV -3.253e-02  9.529e-02  -0.341 0.732979    
## NeighborhoodMitchel  1.400e-01  7.549e-02   1.855 0.064158 .  
## NeighborhoodNAmes    1.209e-01  7.517e-02   1.609 0.108146    
## NeighborhoodNPkVill  1.300e-01  1.359e-01   0.957 0.339102    
## NeighborhoodNWAmes   8.797e-02  7.697e-02   1.143 0.253531    
## NeighborhoodNoRidge  1.998e-01  7.682e-02   2.601 0.009534 ** 
## NeighborhoodNridgHt  1.988e-01  7.215e-02   2.756 0.006031 ** 
## NeighborhoodOldTown  1.524e-01  8.417e-02   1.811 0.070645 .  
## NeighborhoodSWISU    5.465e-02  8.720e-02   0.627 0.531128    
## NeighborhoodSawyer   1.289e-01  7.681e-02   1.678 0.093857 .  
## NeighborhoodSawyerW  7.075e-02  7.442e-02   0.951 0.342146    
## NeighborhoodSomerst  1.333e-01  8.047e-02   1.656 0.098250 .  
## NeighborhoodStoneBr  1.948e-01  7.829e-02   2.489 0.013099 *  
## NeighborhoodTimber   1.189e-01  7.836e-02   1.517 0.129835    
## NeighborhoodVeenker  2.335e-01  8.210e-02   2.845 0.004604 ** 
## Bldg.Type2fmCon      2.917e-03  3.570e-02   0.082 0.934904    
## Bldg.TypeDuplex     -1.302e-01  3.207e-02  -4.061 5.55e-05 ***
## Bldg.TypeTwnhs      -1.492e-01  3.415e-02  -4.370 1.47e-05 ***
## Bldg.TypeTwnhsE     -7.465e-02  2.465e-02  -3.028 0.002567 ** 
## Exterior.2ndBrk Cmn  4.004e-02  1.485e-01   0.270 0.787568    
## Exterior.2ndBrkFace  1.346e-01  5.930e-02   2.270 0.023590 *  
## Exterior.2ndCmentBd  1.808e-01  6.478e-02   2.791 0.005434 ** 
## Exterior.2ndHdBoard  1.001e-01  5.289e-02   1.893 0.058912 .  
## Exterior.2ndImStucc  9.085e-02  6.949e-02   1.307 0.191596    
## Exterior.2ndMetalSd  1.199e-01  5.132e-02   2.337 0.019802 *  
## Exterior.2ndPlywood  4.993e-02  5.417e-02   0.922 0.357037    
## Exterior.2ndStucco   2.039e-01  6.236e-02   3.269 0.001143 ** 
## Exterior.2ndVinylSd  1.121e-01  5.244e-02   2.137 0.032994 *  
## Exterior.2ndWd Sdng  1.235e-01  5.213e-02   2.369 0.018138 *  
## Exterior.2ndWd Shng  1.336e-01  5.659e-02   2.360 0.018609 *  
## Bsmt.QualFa         -1.690e-01  4.323e-02  -3.910 0.000103 ***
## Bsmt.QualGd         -1.042e-01  2.386e-02  -4.365 1.50e-05 ***
## Bsmt.QualNone       -2.872e-01  4.526e-02  -6.346 4.44e-10 ***
## Bsmt.QualPo          1.961e-01  1.356e-01   1.446 0.148628    
## Bsmt.QualTA         -1.406e-01  2.953e-02  -4.763 2.41e-06 ***
## Central.AirY         1.751e-01  2.528e-02   6.924 1.16e-11 ***
## age                 -2.544e-03  4.298e-04  -5.918 5.57e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.115 on 583 degrees of freedom
## Multiple R-squared:  0.917,  Adjusted R-squared:  0.9087 
## F-statistic: 111.1 on 58 and 583 DF,  p-value: < 2.2e-16

2.3.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


final_predicted_values <- exp(predict(aic_final_model))
final_residues <- ames_train$price - final_predicted_values
## Warning in ames_train$price - final_predicted_values: longer object length
## is not a multiple of shorter object length
final_rmse <- sqrt(mean(final_residues^2))
final_rmse
## [1] 114854.5

While testing the model on the out-of-sample data, the RMSE value 114854.5 proved that the model predicts very poorly the house price. This could be due to

  • Sale.Condition with all values, pushed it to be an non important variable (as we will see below in section 4.1)
  • Age of a house hasn’t been taken into account
  • Many categorial values had NAs associated to it
  • Also there are many outliers associated with the data.

Let’s run the same for the final model (with outliers)

  ames_test_transformed <- ames_transform(ames_test)
  ames_test_transformed$Sale.Condition <- ames_test$Sale.Condition

    model_var_level <- names(aic_final_model$xlevels)
    aic_final_model$xlevels <- sapply(model_var_level, function(x) union(aic_final_model$xlevels[[x]], levels(ames_test_transformed[[x]])))
    predict_test_data <- exp(predict(aic_final_model, newdata=ames_test_transformed))
## Warning in predict.lm(aic_final_model, newdata = ames_test_transformed):
## prediction from a rank-deficient fit may be misleading
    final_test_residue <- ames_test_transformed$price - predict_test_data
    final_test_rmse <- sqrt(mean(final_test_residue^2, na.rm = TRUE))
    print(final_test_rmse)
## [1] 44150.01

The value of 44150.01 proves the model is very much improved.


2.4 Part 4 Final Model Assessment

2.4.1 Section 4.1 Final Model Residual

Right Skewness causes a nearly not-normal distribution with left half having less data. (check the below histogram)

histogram(final_test_residue)

This is explained by

  • In the “Residual vs Fitted” plot, the first part of red line is just above 0 explaining the skewness. After that it is well maintained near zero.
  • Also Normal Q-Q shows the problem due to the outliers
plot(no_outlier_final_model)
## Warning: not plotting observations with leverage one:
##   78, 127, 241

## Warning: not plotting observations with leverage one:
##   78, 127, 241

2.4.2 Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.


When we try to find the RMSE for the no-outlier model, the value is decreased, indicating that we can use it for the prediction of non-outliers.

    model_var_level <- names(no_outlier_final_model$xlevels)
    no_outlier_final_model$xlevels <- sapply(model_var_level, function(x) union(no_outlier_final_model$xlevels[[x]], levels(ames_test_transformed[[x]])))
    predict_test_new_data <- exp(predict(no_outlier_final_model, newdata=ames_test_transformed))
## Warning in predict.lm(no_outlier_final_model, newdata =
## ames_test_transformed): prediction from a rank-deficient fit may be
## misleading
    final_test_residue_no_outlier <- ames_test_transformed$price - predict_test_new_data
    final_test_rmse_no_outlier <- sqrt(mean(final_test_residue_no_outlier^2, na.rm = TRUE))
    print(final_test_rmse_no_outlier)
## [1] 38312.02

The below plot shows that type 2 (of final_test_residue_no_outlier) is well around 0 compared to the type 1 (of final_residues). The residues are much less scatterd than the training data compared to the plot which we have seen in Section 2.5

This shows the good sign of the model

residual_comp1 <- data.frame(index<- seq(from=1, to=dim(ames_train)[1], by=1), residue <- final_residues, type <- rep(1,dim(ames_train)[1]))
residual_comp2 <- data.frame(index<- seq(from=1, to=dim(ames_test_transformed)[1], by=1), residue <- final_test_residue_no_outlier, type <- rep(2,dim(ames_test_transformed)[1]))
colnames(residual_comp1) <- c("index", "residue", "type")
colnames(residual_comp2) <- c("index", "residue", "type")
initial_residual_comparison <- rbind(residual_comp1, residual_comp2)
ggplot(initial_residual_comparison, aes(x=index, y=residue)) + geom_point(aes(color=type, alpha=0.5))

final_interval_prediction <- exp(predict(no_outlier_final_model, interval="prediction", newdata=ames_test_transformed))
## Warning in predict.lm(no_outlier_final_model, interval = "prediction",
## newdata = ames_test_transformed): prediction from a rank-deficient fit may
## be misleading
percentage_covered <- mean(ames_test_transformed$price > final_interval_prediction[,"lwr"] & ames_test_transformed$price < final_interval_prediction[,"upr"], na.rm = TRUE)
percentage_covered
## [1] 0.8286414

82.86% of the housing prices are well reasoned by the variables.


2.4.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


2.4.3.1 Section 4.3.1 Strengths

  • 95 % Prediction interval is narrow.
  • Very few outliers and on removing them lead to smaller RMSE and very strong model to validate other data.
  • The model was build by automated AIC functions and knowledge gathered from manual written Forward selection of variables with AIC. This lead to well computed model.

2.4.3.2 Section 4.3.2 Weakness

  • With outliers, the model could predict wrong values
  • The coverage of 82.86% can be well increased as we are losing around 17.14% explanation
  • Final RMSE 38312.02 is still high.

2.4.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

ames_validation_transformed <- ames_transform(ames_validation)
ames_validation_transformed$Sale.Condition <- ames_validation$Sale.Condition

model_var_level <- names(no_outlier_final_model$xlevels)
no_outlier_final_model$xlevels <- sapply(model_var_level, function(x) union(no_outlier_final_model$xlevels[[x]], levels(ames_validation[[x]])))
predict_validation_data <- exp(predict(no_outlier_final_model, newdata=ames_validation_transformed))
## Warning in predict.lm(no_outlier_final_model, newdata =
## ames_validation_transformed): prediction from a rank-deficient fit may be
## misleading
validation_residue <- ames_validation$price - predict_validation_data
validation_rmse <- sqrt(mean(validation_residue^2, na.rm = TRUE))
validation_rmse
## [1] 51398.32

The RMSE of 51398.32 shows that the model can still be improved. This is due the outliers present as below

plot(ames_validation_transformed$price)

Let’s remove the most priced 20 data and try again

ames_validation_transformed2 <- ames_validation_transformed[-head(order(ames_validation_transformed$price, decreasing = T), n=20),]
model_var_level <- names(no_outlier_final_model$xlevels)
no_outlier_final_model$xlevels <- sapply(model_var_level, function(x) union(no_outlier_final_model$xlevels[[x]], levels(ames_validation_transformed2[[x]])))
predict_validation_data <- exp(predict(no_outlier_final_model, newdata=ames_validation_transformed2))
## Warning in predict.lm(no_outlier_final_model, newdata =
## ames_validation_transformed2): prediction from a rank-deficient fit may be
## misleading
validation_residue <- ames_validation_transformed2$price - predict_validation_data
validation_rmse <- sqrt(mean(validation_residue^2, na.rm = TRUE))
validation_rmse
## [1] 47149.05

Now the value is decreased to 47149.05 as a good sign.

Also check the residue comparison plot. It shows that the predicted residues are in the same disperse level as the training data.

residual_comp1 <- data.frame(index<- seq(from=1, to=dim(ames_train)[1], by=1), residue <- final_residues, type <- rep(1,dim(ames_train)[1]))
residual_comp2 <- data.frame(index<- seq(from=1, to=dim(ames_validation_transformed2)[1], by=1), residue <- validation_residue, type <- rep(2,dim(ames_validation_transformed2)[1]))
colnames(residual_comp1) <- c("index", "residue", "type")
colnames(residual_comp2) <- c("index", "residue", "type")
initial_residual_comparison <- rbind(residual_comp1, residual_comp2)
ggplot(initial_residual_comparison, aes(x=index, y=residue)) + geom_point(aes(color=type, alpha=0.5))

validation_interval_prediction <- exp(predict(no_outlier_final_model, interval="prediction", newdata=ames_validation_transformed2))
## Warning in predict.lm(no_outlier_final_model, interval = "prediction",
## newdata = ames_validation_transformed2): prediction from a rank-deficient
## fit may be misleading
percentage_covered <- mean(ames_validation_transformed2$price > validation_interval_prediction[,"lwr"] & ames_validation_transformed2$price < validation_interval_prediction[,"upr"], na.rm = TRUE)
percentage_covered
## [1] 0.7648721

76.48% of the housing prices are well reasoned by the variables.


2.5 Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


Overvalued vs Undervalues price vs area comparison. The values are determined from the IQR percentile of the house prices

undervalued = summary(ames_validation_transformed$price)[[2]]
undervalued_data <- ames_validation_transformed2[which(ames_validation_transformed$price <= undervalued),] %>% select(area,price,Neighborhood)
undervalued_data$type <- "undervalued"
colnames(undervalued_data) <- c("area", "price", "Neighborhood", "type")
overvalued = summary(ames_validation_transformed$price)[[5]]
overvalued_data <- ames_validation_transformed2[which(ames_validation_transformed$price >= overvalued),] %>% select(area,price,Neighborhood)
overvalued_data$type <- "overvalued"
colnames(overvalued_data) <- c("area", "price", "Neighborhood", "type")
misfit <- rbind(overvalued_data,undervalued_data)
suppressWarnings(ggplot(data=misfit, aes(x=log(area), y=log(price), color=factor(type), shape=factor(type))) + geom_point())
## Warning: Removed 8 rows containing missing values (geom_point).

Count of undervalued and overvalues home in Neighborhood

misfit %>% filter(!is.na(price)) %>% group_by(Neighborhood) %>% summarise(undervalued_count = sum(ifelse(type == "undervalued", 1, 0))) %>% arrange(desc(undervalued_count)) %>% slice(1:10)
## # A tibble: 10 x 2
##    Neighborhood undervalued_count
##    <fct>                    <dbl>
##  1 NAmes                       38
##  2 OldTown                     18
##  3 Sawyer                      15
##  4 CollgCr                     12
##  5 Edwards                     12
##  6 Gilbert                     12
##  7 BrkSide                      9
##  8 Mitchel                      9
##  9 IDOTRR                       8
## 10 NWAmes                       7
misfit %>% filter(!is.na(price)) %>% group_by(Neighborhood) %>% summarise(overvalued_count = sum(ifelse(type == "overvalued", 1, 0))) %>% arrange(desc(overvalued_count)) %>% slice(1:10)
## # A tibble: 10 x 2
##    Neighborhood overvalued_count
##    <fct>                   <dbl>
##  1 CollgCr                    20
##  2 NAmes                      20
##  3 OldTown                    19
##  4 NWAmes                     13
##  5 NridgHt                    13
##  6 Gilbert                    11
##  7 Sawyer                     11
##  8 Edwards                    10
##  9 Mitchel                    10
## 10 BrkSide                     9

2.5.1 Findings

  • Model which we have derived can still be improved by removing more unwanted values as we witnessed through out the document and that will also increase the probability coverage
  • Even though we thought that BIC was good due to apply of the penalty, it turned out that AIC was better in finding the model
  • If we have many variables, bas.lm was not usable because of 2^n combinations and we have step in for finding the best model. Also we need to take care of Forwards/Backward selection of variables using adj-R-squared/BIC/AIC (whatever be the types) as it involves more variables
  • Three main variables are area, Overall.Qual, MS.Zoning and age
  • log of the variables area, price has to be applied to make sure that the model is good
  • As the model is fit based on the skewness of the residual distribution, the model would behave differently for different data set
  • Normal distribution helps in the appropriate finding of the response variable